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ABSTRACT 

The usefulness of studying thought processes in the 
construct validation of ability tests was examined using a sample 
consisting of 343 Canadian senior high school students. Four levels 
of probing were used by the interviewer to examine the students' 
thinking processes while taking the Test on Appraising Observations: 
(1) think aloud; (2) immediate recall — examinees were asked why they 
chose that answer; (3) criteria probe — examinees were asked whether a 
feature of a test item determined the answer; and, (4) principle 
probe — examinees were also asked whether answer choice was based on 
particular general principles. Two scores were derived: performance 
scores or number right, and thinking scores indicating the quality of 
thinking displayed. Analyses were concerned with three questions: (1) 
whether the verbal reports accurately portray the thinking that takes 
place; (2) whether thinking concurrent with reporting differs from 
thinking in testing situations in which reports are not elicited; and 
(3) whether thinking subsequent to reporting is different from 
thinking subsequent to non-reporting testing situations. Analyses of 
variance were conducted to determine the effects and interactions of 
interview group, interviewer, sex, and grade level. Seven types of 
verbal activity were noted during the interviews. (GDC) 
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Th»s report describes an experinent designed to explore 
the usefulness of studies of thinking processes in the construct 
validation of ability tests. A study of thinking processes 
is one in which an attempt is made to gain information on the 
mental processes which people use to perform tasks, that is, 
to describe the strategies, and kinds of inforoation which lead 
to performance. A study of thinking processes typically does 
not lead to the direct observation of mental processes (though 
the possibility of direct observation cannot be ruled out, 
especially in th© future), but allows more trustworthy inferences 
to be made about their nature than can be obtained through the 
examination of performance at the strictly task level. A study 
of thinking processes represents a concerted attempt to "look 
beneath the surface" of directly observable task behaviour to 
discover its underlying causes. This requires both the invention 
and justification of appropriate probing techniques and the 
imaginative hypothesizing of mechanisms and processes which 
can account for what is found using these techniques. 

Under the description above, studies of thinking processes 
go hand in hand with the construction of theories of human mental 
abilities, and are explicitly designed to facilitate this 
activity by increasing the reliability of inferences from data 
to theory. The process of construct validation of ability tests 
has also been linked to theory construction, so it is natural 
to think that studies of thinking processes are relevant to 
construct validation. If construct validation is conceived 
(at least in part) as the identification of the mental processes 
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Hhich underlie task perforaancef as has been done by Susan 
Efflbretson(Whitely) (1983) in her conception of construct 
representation, then the relevance of studies of thinking 
processes to construct validation can be Aore readily seen. 
Evidence for the construct validity of an ability test is 
obtained to the extent that good performance can be explained 
by examinees* following sound thinking processeS| and to the 
extent that poor performance can be explained by deviations 
from such proct»«5ses. Studies of thinking processes can provide 
the information needed to judge the soundness of thinking 
processes used. 
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I. THE PROBLEM 
The discussion thus far has argued that in principle 
the infornation gathered in studies of thinking processes ought 
to be relevant to construct validation. This is true only if^ 
the inforaation on people's test thinking which these studies 
yield is an accurate reflection of the thinking which would 
have taken place had the people taken the test outside the 
study. Studies of thinking processes typically require that 
subjects provide introspective reports of the progress of their 
thinking, or provide reasons for their perf oraanc It is not 
known whether such requirements alter thinking frojri what would 
have taken place under testing conditions in which such verbal 
reports are not provided. 

The study addressed the following two general questions: 

1. Are introspective reports of thinking reflective 
of the thinking that actually takes place? More 
specifically, does the accuracy of introspective 
reports of thinking depend upon the fiianner in which 
the report is elicited? 

2. Do introspective reports of thinking reflect the 
thinking that takes place in testing situations in 
which only outcomes of thinking and not the thinking 
itself is reported? That is, does the eiicitation 
of introspective reports change the course of thinking 
froffl what it would have been without the eiicitation? 
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II. THEORETICAL PERSPECTIVE 

Belief in the potential usefulness of studies of thinking 
processes is often motivated by the perspective of scientific 
realism. Fundaoental to the philosophy of scientific realisa 
is the view that scientific investigation is aiaed towards, 
aaong other things, the identification of the underlying causes 
of directly observable phenoaenon. The postulation of 
theoretical entities is taken to be speculation about the real 
constitution of the world, speculation which is then t(?sted 
through further exploration. 

Scientific realisa is often contrasted with 
instruaentalisffl or positivisa, views which do not concede the 
reality of theoretical entities. On these accounts theoretical 
entities are taken to be the iaaginative speculations of 
scientists; useful fictions designed to bring coherence to a 
vast array of unconnected observables. To this view the 
scientific realist retorts: "But if theoretical entities are 
supposed to be the underlying causes of what is directly observed 
(a view which both instruaentalists and positivists usually 
espouse), how can they serve this role if they are fictions 
in the minds of scientists? Causes make things happen, a 
function which fictions in the ainds of scientists are singularly 
unsuited to perforai" 

The goal of construct validation is to discover the causes 
of performance on tests. When the tests are aental ability 
tests the attempt in test design is to make a test such that 



ERLC 



7 



6 

the designated nental ability is thn cause of test perforaance, 
and construct validation is conducted to deteriiine the extent 
to which this has been achieved. Mental abilities are assuned 
to underl ie performance in the sense that they are not currently 
directly observable, but aust be inferred fro« what can be 
observed. The observation of performance alone typically leads 
to highly ambiguous inferences about underlying abilities because 
competing possible causes of perfoj-mance cannot be ruled out. 
Some method is needed to push back the bounds of the observable 
beyond typical sorts of test performance, a task for which 
studies of thinking processes are particularly wel 1 suited, 
and a task which oust be accomplished for science to proceed 
(Norri s , forthcoming) . 

The procedure is somewhat complicated by the fact that 
we are not able to specify the nature of mental abilities in 
advance of doing the scientific investigation. It is precisely 
this knowledge which the investigation is designed to achieve. 
Imagine, then, wanting to construct a test of mental ability 
"X". Nat knowing the nature of ability X, how is it to be 
recognized that X and not some other ability the cause of test 
performance? At the current stage of the science of mental 
abilities, ability X is likely defined in terms of directly 
observable performances. Therefore, to take the performances 
alone as evidence that X is the operative cause of those 
performances is to reason in a circle. The issue is complicated, 
and cannot be resolved without the interplay of scientific and 
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philosophical reasoning (Norris, 1985). Two pertinent questions 
which aust be answered include: (i) Can we imagine how the 
operation of the postulated aental processes would produce the 
perforfliances? and (ii) Are we willino to conclude that these 
processes are manifestations of the ability we are trying to 
test? There are no fixed rules for answering such questions, 
but it is clear they require deep thought by those thoroughly 
iaaersed in the field. 
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in. HISTORY OF STUDIES OF PROCESS 
Studies of thinking pro':esses have experienced a long 
history of endorsertent by test validation theorist!. For 
example, Cronbach (1971, p. 474) suggested that such studies 
can usually amplify the raeaning of constructs. This enclorseflient 
is contrasted with very few reported examples of research of 
this type. 

One of the earliest and most extensive studies in this 
tradition was conducted by B.S. BIooa and L.J- Broder in 1950 
on the tf.inking processes of college students solving certain 
test problem!. Bloom and Broder believed that inferences from 
test behaviour to underlying mental processes are untrustworthy 
unless they rely on explicit exploration of those processes. 
They knew that many mental processes could lead to the same 
performance, and provided examples of sound thinking leading 
to incorrect preformance and unsound thinking resulting in 
correct solutions. Their approach to gaining more direct 
information on mental processes was to have examinees think 
aloud while answering questions on a test. They found that 
by first giving practice on thinking aloud while solving some 
simple multiplication problems subjects were able to provide 
more detailed reports. One of their major conclusions was that 
"the method of thinking aloud served ... to yield relatively 
consistent and meaningful data from the majority of subjects" 
(1950, p. 90). 
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Blood and Broder failed to question the meaningf ulness 
of the inforaation contained in subjects' reports for situations 
in which the subjacts would not have been asked to think aloud. 
Just because the introspective reports were consistent and 
meaningful does not raean that they were reflective of thinking 
that would have taken place in other sorts of situations. If 
requiring people to think aloud as they work through test 
quest i ons aakes their thinking substantially different f roa 
what it would have been had they not thought aloud, then the 
information gathered in the validation study is not relevant 
to testing situations in which verbal reports of thinking are 
not sought. 

In another study R,P. Kropp (1956) exaained the 
relationship between thinking processes revealed in oral problea 
solving and the solutions provided to the problens. Like Blooa 
and Broder, Kropp concluded that verbal »^eport3 of thinking 
reveal a great variety of oental processes leading to the saae 
answer. Kropp also concluded that the technique is useful for 
exposing ambiguities and hidden cues in test itens. Still, 
the question of what is learned froo think aloud contexts about 
normal test taking contexts, and the question of the accuracy 
of think aloud reports were no better understood after this 
study. 

C. McGuire (>963) reported on an attempt to help iaprove 
the construction and interpretation of an exaiaination by using 
experts' introspective judgecaents of the aental processes 
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required to answer questions on it, and students* reports of 
the processes they followed while taking the test. She found 
that the method had a fair degree of usefulness in designing 
tests of more cooplex oental processes and in bringing student 
assessment into better agreement with the objectives of the 
instruction. Without further elaboration she also remarked 
that it became apparent that "the interview [technique] did 
not sufficiently simulate an examination situation to allow 
sound conclusions to be drawn" (p. 9). One cannot be certain, 
but I assume she meant sound conclusions about whether the 
results were applicable to situations in which introspective 
reports were not gathered. If this is what she intended it 
is a puzzling and disappointing fact that she did not explain 
her position further. At the same time, it is important that 
she recognized a problem which still needed to be explored. 

In 1964 J. A. Connolly and M.J. Wantman reported a study 
which they considered to be an improvement upon the original 
one in this tradition by Bloom and Broder. Like Bloom and 
Broder, they assumed that inferences about the nature of 
reasoning processes drawn from typical item analysis statistics 
are tenuous at best (p. 59). In the study, subjects (of which 
there were only 9) were told to think aloud, reporting all 
thoughts that might cross their minds during their attempts 
to respond to a set of test items. No probing other that this 
non-directive instruction was used. As in the Bloom and Broder 
study, instances were found of good thinking coupled with 
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incorrect answers and poor thinking with correct answers. 
Adequacy of thinking was rated in accord with a aodel of quality 
thinking on the test iteffls. The overall conclusion was that 
the technique is useful for pretesting ite/ns in the construction 
of a test. 

H. Schujnan (1966) used the technique of probing people's 
reasons for the answers they chose on a test. The probing was 
conducted after the test was completed, with each individual 
probed on a randomly selected set of items from all thosd 
contained on the test. Responses were evaluated on a five point 
scale, with a score of **1" given for an explanation which was 
quite clear and led to accurate prediccion of the answer chosen, 
to "5" for an explanation that was very unclear and did not 
support any prediction about the answer chosen. Total scores 
for individuals over all items on which they were probed were 
calculated, the J ower the score indicating the higher the 
individual's understanding of an item. Total iteoi scores for 
each item over all individuals who were probed on it were also 
calculated, and indicated the group's understanding of the 
individual items. The qualitative information contained in 
the analysis of the verbal reports was also used to help 
understand "more precisely what [the analyst] is measuring — 
which is, after all, the final goal of 'validity'." 

A colleague and I used think aloud protocols in the 
development of a critical thinking test on appraising 
observations (Norris and King, 1984). Our desire was to conduct 
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the interviews in a fundamentally nonleading fashion. We wished 
to influence students' thinking as little as possible, realizing 
that just asking theo to think aloud and placing then alone 
with a stranger aight have effects in themselves. Still, it 
seemed on occasion that interrupting a student's narrative might 
be more beneficial than not, particularly when the interruption 
was merely to clarify the ambiguous referent of a pronoun, or 
to point out obvious reading errors. Although we did not wish 
to rush examinees, to cut off reasoning by inadvertent signals, 
or to endorse or criticize particular reasoning attempts, we 
did wish to obtain records of reasoning which were as complete 
as possible. To fulfill this aim it was often necessary to 
probe beyond the initial instruction to think aloud. This 
probing was done only after examinees had chosen their answers 
to questions and had finished reporting on their thinking. 
Even in these follow-up stages probing was as nonleading as 
possible, merely echoing already reported thoughts or asking 
to explain choices of answers a little mors fully. It was in 
the context of developing this test that the present study nas 
conceived. 
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IV> RELEVANT STUDIES IN NON-TESTING CONTEXTS 
The essential nature of studies of thinking processes 
is that they are attempts to extract information from people's 
ineinories, usually their short term or very recent memories. 
This fact suggests that research on creation of and extraction 
of information from memory would be relevant. Much of this 
research can be found in studies of the use of verbal reports 
as data in the information processing tradition and in studies 
of eyewitness testimony. 

Verbal Reports as Data 
Much of the work on the trustworthiness of verbal reports 
of mental processes is reviewed in one of three recent articles 
and a recent book (Ericsson and Simon, 1980, 1984; Nisbett and 
Wilson, 1977; Smith and Miller, 1978). The essence of the 
Nisbett and Wilson report is that people have little or no 
introspective access to the things which stimulate their 
cognitive processes. Ericsson and Simon and Smith and Miller 
are critical of this conclusion, and claim that people do have 
dependable access to their mental processes in^ certain 
si tuati ons. 

Nisbett and Wilson conclude three things: (i) people 
often cannot accurately report the effects of certain stimuli 
on their responses to problems requiring higher order thinking; 
(ii) when people do report on such s ili they often do not 
search their memories to discover what the stimuli were, but 
rather appeal to plausible hypothetical mechanisms which they 
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accept a. priori ; and (iii) when people are correct about the 
stiauli affecting their responses they have coincidental ly 
employed a hypothesis which happens to be correct* 

Not all of the studies reviewed by Nisbett and Wilson 
can be described here for the number is quite large. They reliod 
on evidence from the cogni ti ve di ssonance 1 1 terature, the 
self-perception attribution literature, the learning without 
awareness literature, and the literature on probleo solving, 
among other fields. From studies on cognitive dissonance and 
self-perception they concluded that people can change their 
attitudes without any apparent awareness of such chsnge, and 
can be motivated by things of which they are not awa^e. They 
argued that results from studies of problem soJving suggest 
that experimental subjects are usually not aware of 
experimentally manipulated factors which have influenced their 
rfjsponses. They also review a series of studies designed to 
demonstrate people's inability to report accurately on the 
effects of experimentally controlled stimuli on their responses. 
For example, people are not aware of the effect that position 
on the rack has on their selection of a garment; are not aware 
of the effect of people's personalities on their assessment 
of those people's appearance; are unable to accurately report 
on the effect of distractions on their reactions co such things 
as a film; and are unable to accurately rate the effect of being 
assured there was no danger on their willingness to subject 
themselves to such things as electric shock. 
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Nisfaett and Wilson do suggest situations in which accurate 
verbal reports can be expected. These are characterized by 
an available influential stiraulus, a stiaulus which is a 
plausible cause of the response, and a lack of other plausible 
causes of the response. The experi(aental situations upon which 
they base their conclusions do not meet all of these conditions. 
In particular, experiaents are situations in which the 
influential sti/aulus is not available to the subjects because 
it is "systematically and effectively [hidden] frora the* by 
[the] experimental designs" (Saith and Hiller, 1978, p. 356). 
It is on this point that Smith and Hiller criticize Nisbett's 
and Wilson's conclusions oost severely, because they apply only 
to situations (experimentally controlled oneo) in which the 
outcome, subjects' una-^areness of what was influencing their 
thinking, is what would naturally be expected. Nisbett's and 
Wilson's analysis does not inform us of whether in other 
situations people's mental processes are more accessible to 
them. 

Another limitation of the Nisbett and Wilson analysis 
arises from the depth of the mental processes which they 
examined. They are dealing with subtle mental processes such 
as those which govern the formation of attitudes, which stimulate 
insightful solutions, and which bias evaluations. What kinds 
of mental processes are these? Are they of the sort that people 
can. be aware of them? If they are the sort of process that, 
say, governs such things as the human heart beat or regulates 
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breathing, then it- is not surprising that people cannot access 
then. As Smith and HiUer point out, people are not even able 
to report on the aental processes involved in less deep, but 
yet routine, processing such as that involved in producing 
ansNers to nell-learned multiplication tables. 

Ericsson and Simon (1980, 1984) discuss the 
trustworthiness of verbal reports on mental processes in light 
of a theory of thinking conceived as information processing. 
They conclude that instructions to verbalize do not change the 
course of cognitive processing, but merely slow it down, when 
subjects are verbalizing information that would normally be 
available to them in short-term memory. Specific and directive 
probes alter cognitive processing, however, as do requests to 
supply motives and reasons. This conclusion is particularly 
relevant for te3t validation contexts, since it is the provision 
of reasons for answers that is often sought, ThJ* conclusion 
suggests that information about test validity gathered in 
interview contexts might not be applicable to testing contexts 
in which interviewing was not done* With regard to the 
completeness of verbal reports of thinking, Ericsson and Simon 
conclude that certain types of things tend to be omitted. 
Processes that are so well learned that they have become 
automatic tend not to be reported, and often subjects are able 
to behave in accord with rules without being able to x^erbalize 
them, 
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One of the values of the Ericsson and Simon work is that 
it gives specific inforaation on the situations in which one 
can expect verbal reports to be trustworthy, and on the ones 
in which they are justifiably aistrusted. In particular, their 
research indicates that the less leading the probe enployed 
the more accurate the information obtained, and that «ore 
information with an overall lower trustworthiness can be obtained 
with more leading probes. However, it is not legitimate to 
assume that this research answers all the questions for testing 
situations. Testing contexts are sufficiently different from 
the ones in which information processing research is conducted 
that it is reasonable to assume that memory retrieval and 
information processing demands might also differ. In particular, 
taking tests is a situation that carries with it certain 
assumptions about how one should try to perform, how the results 
reflect upon the individual, and so on. These assumptions are 
probably different from those that go along with being involved 
in a study in which tests are not given, and possibly lead to 
different influences on performance. 

Eyewitness Testimony 
Eyewitness testimony is often contained in verbal reports 
of what people can remember, or claim to be able to remember. 
These reports are often given in response to instructions of 
one sort or another. Reports of what examinees are thinking 
when responding to a test are similar sorts of things. In one 
situation people search their memories for recollections of 
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what they have observed} in the other for what they have 
thought. It is reasonable to believe that the recall processes 
in both situations are related. Thus the eyewitness testiaony 
literature, which contains information on the factors which 
affect the accuracy of reports of observations, is pertinent 
to the question of the accuracy of reports of thinking while 
taking tests. The degree of pertinence is teupered by the 
dissimilarities between the two situations: in one, recall 
of the recognition of an external event takes place, whereas 
in the other recall of an internal event occurs; in one, neaory 
is probed about events in the nore distant past, whereas in 
the other the aefflory is of events in the very recent past. 

The most relevant eyewitness testiaiony research for the 
present study concerns the effect of different types of 
questioning on the accuracy of reports. Three categories of 
questions have been studied? (i) those eliciting fjiee, reports 
(for example, "Tell us all that you sawM; (ii) those eliciting 
controlled reports (for example, "Give us a description of what 
your assailant was wearing"); and (iii) those eliciting 
altt*rnate-choice reports (for example, "Did your attacker have 
dark or light hair?") (Loftus, 1979, p. 90). Two general 
conclusions can be drawn on the basis of many independent tests 
of the influence of these types of questioning techniques. 
The first is that free reports tend to be more accurate than 
any other type of report , control led reports rank next in 
accuracy, and al ternate-*choice reports have the lowest degree 
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of accuracy. The second conclusion is that the aaount of 
information obtained increases in the opposite direction: free 
reports contain the least amount of information, controlled 
reports sonewhat aore, and alternate-choice reports the aost 
of all. So then, free reports give a relatively lesser aaount 
of relatively more accurate inforiaation, and alternate-chcice 
'^eports a relatively greater amount of relatively less accurate 
information. Independent support for these results has been 
given by /aany investigators including Clifford and Scott (1978), 
Dale, Loftus and Rathbun (1978), Harris (1973), Hilgard and 
Loftus (1979), Lipton (1977), Loftus and Palter (1974), and 
Marquis, Marshall and Oskaap (1972). The results are also 
largely consistent with the theory and evidence "offered by 
Ericsson and Simon* 

As with the research on verbal reports as data, it Is 
not legitimate to assume that the results of eyewitness testimony 
research can be applied directly the testing situation. 

Eliciting reports of thinking on tests is different from 
eliciting recollections of observed events, and there is no 
research which explores the relevance of these differences to 
factors affecting the accuracy of both types of report. In 
addition, testing is a different social context from involvement 
in psychological experiments, and it is not known how this fact 
would influence recall from memory. 
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V. METHOD 
Saaplg 

Five senior high schools were chosen on the east coast 
of Newfoundland, Canada. The cooiunities in which the schools 
were located ranged from one-industry coomunities with less 
than 1000 people to a somewhat larger town of about 3000, The 
total sample consisted of 343 students which included all of 
the students in grades 10, 11, and 12 in four of the schools, 
and about half of those in the other. This sample provided 
for a broad range of student abilities. In addition, although 
all the schools were in small comaunities, they were within 
coawuting distance of the capital city and indeed many of the 
teachers coffliuuted every day. Thus, the schools experienced 
little trouble in attracting highly qualified teachers. In 
addition, the students in these schools scored at or above the 
national average on standardized measures of achievement. 

Procedure 

A completely randooized factorial design was used to 
study the effect of various levels of probing on exaiinees' 
thinking processes while they worked through Part A of the Test 
on Appra ising Observations (Norris and King, 1983), Four levels 
of probe were usedi (i) Think Aloud , in which examinees were 
asked to report all they were thinking as they worked through 
the items; (ii) Immediate Recall , in which examinees were asked 
to tell why they had chosen the answer they did; (iii) Criteria 
Probe, in which a feature of each test item was mentioned and 
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exaainees were asked i^hether those features (lade any difference 
to the answcirs they chosej and (iv) Principle Probe i which was 
a criteria probe with the additional question of whether choices 
of answer were based upon particular general principles. The 
probes vary in degree of "1 eadingness" (according to standard 
concepts of what it is to be a " '*ing question), and also vary 
in the task required. The f i rst level of probe gives 
considerable leeway for examinees to report as they see fit, 
while the subsequent ones ask for particular sorts of information 
and are thus more directive of the task to be carried out. 

An associate and I each selected students according to 
the order they appeared on class lists. They were taken fro« 
their classes one at- a tine and randoaly assigned to one of 
the experimental groups, either one of the probe groups or a 
control group. Students falling into the probe groups were 
asked to first work through iteas 1-15 while they were 
interviewed. As they worked through each question they were 
asked to mark their answers on the answer sheet and either to 
think aloud, or to tell why they had chosen their answer, or 
to respond to either the criteria or principle probe. The 
reports were tape recorded. The reaaining 13 items on Part 
A were then completed by the students working privately in a 
more normal testing situation. Students in the control group 
were not interviewed, and were asked to work privately through 
all 28 items on Part A while marking their answers on the answer 
sheet. The raw data thus consisted of answers narked on the 
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answer sheets and tape recorded protocols for those in the 
experimental groups. 



Two sets of scores were derived froa the raw data. The 
first set consisted of perf oraance scores , nuabers of iteas 
right according to the key provided with the test (Norris and 
King, 1985). Total nuaber of questions correct was calculated 
for iteas 1-15 and for iteas 16-28. The second set of data 
consisted of thinking scores . Thinking scores were deterained 
for iteas 1-15 for all students in the experiaental groups. 
Scores reflected the quality of thinking displayed in the 
protocols on a scale of 0-3 for each itea (Norris and King, 
1984). Total thinking scores for iteas 1-15 were calculated. 

The following three questions were addressed in a series 
of quantitative and qualitative analyses; 

1. I/O verbal reports of thinking on tests accurately 
portray thinking that takes place? 



thinking in testing situations in which reports of thinking 
are not elicited? 

3. Is thinking subsequent to reporting different from 
thinking subsequent to testing situations in which reports of 
thinking are not elicited? 
Quantitative Analyses 

Quest i on 1 ! Do verbal reports of thinking on tests 
accurately portray thinking that takes place? Verbal reports 



Data Analysis 



Is thinking concurrent with reporting different froa 
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of thinking can be useful in the val'idation of ability tests 
which are to be used in situations Hhere such reports are not 
elicited only when the reporting does not shift thinking froa 
the course it would have followed had the reporting not taken 
place. However, even if this condition is satisfied, a further 
issue remains. Do verbal reports of thinking give an accurate 
portrayal of the course thinking follows, regardless of whether 

tlie reports changes the course of thinking? This is 

the issue addressed by question I, and is the issue raised by 
the first general question posed at the outset of this report. 

Trying to answer this question raises a vexing issue, 
for which only a conproaise solution is currently available. 
The issue involves the availability of a criterion of accuracy 
of repcrts of thinking. In some sort of ideal situation, what 
the scientist would like to do is follow the course of thinking 
independently of the person engaging in it, by having a "window 
into the brain" or soae such access. Then the watch between 
the person's verbal reports of thinking and the scientist's 
independent observation of it could be coapared and We wouid 
obtain a measure of accuracy of the verbal reports. No such 
ideal situation can currently be created, nor even approximated. 
In addition, while recognizing that a complex of interconnected 
experiments might provide inferential access to people's 
thinking, the source of information upon which the scientist 
must rely most extensively is the thinker's own verbal reports 
of thinking. But it is the accuracy of such verbal reports 
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as portrayals of the course of thinking that is at issue in 
this study* Some conproeise oust be sought. 

Indicators of accuracy have been suggested f roa tiae 
to time. Ericsson and Siaon (1980) suggest that the 
investigator's judgement of the coapleteness of a subject's 
reasoning can indicate accuracy. If the reported reasoning 
laJks something the investigator has good reason to believe 
was needed, then the report can be judged incoaplete to this 
extent. While useful, this criterion depends on the imagination 
and insight of the investigator and for this reason is likely 
to be applied unevenly across situations. Schuman (1966) 
recommends gauging the accuracy of verbal reports by the extent 
to which they lead to correct predictions of the subject's choice 
of answer. To the extent that correct predictions can be made, 
the reports are judged accurate. One problem with this approach 
is that there are factors other than thinking which affect 
subjects' choices of answers. Thus, even perfectly accurate 
reports of thinking would not necessarily lead to accurate 
predictions of responses. In addition, sometimes accurate 
predictions can be made independently of any knowledge of 
subjects ' thinking. 

Given no clear best way to proceed, I decided to take 
the thinking scores obtained by examinees in the Think Aloud 
group to be the criterion against which to compare reports from 
the other groups. This approach assumes that differences in 
thinking processes among the groups would show up in differences 
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in thinking scores, and assumes rather than studies the accuracy 
of the Think Aloud reports. However, it is generally conceded 
(Ericsson and Siaon 1980, 19845 Loftus, 1979) that free reports 
such as those given in the Think Aloud group are the aost 
accurate of all* The main issue concerns not the accuracy of 
what is reported, but the completeness * That which is reported 
is generally assuaed to be trustworthy, but it is also assuaed 
that aspects of thinking are not reported in such situations. 

With Thinking Score as the dependent variable^ and the 
Think Aloud group taken as the control, a4x3x2x2 fixed 
effects analysis of variance was performed using the SPSS MANOVA 
procedure and with Interview, Group, Grade Level, Interviewer, 
and Sex as independent variables. This analysis allowed between 
5 and 6 observations per cell given the 271 subjects in these 
four interview groups. The Non-interview group was excluded 
from this analysis. 

Question 2; Is thinking concurrent with reportino 
different from thinking in testing situations in which reports 
of — thinking are not elicited? If it can be concluded that verbal 
reports provide accurate portrayals of thinking that is taking 
place, the first condition for the usefulness of studies of 
thinking processes to test validation has been »et. The second 
condition, raised by the second general question at the beginning 
of this report, requires that eliciting the verbal reports does 
not itself affect the course of thinking. If it does, then 
the usefulness of studies of process would be liaited to testing 
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situations in which verbal reports of thinking are also 
elicited. Such types of tests are possible, and maybe even 
desirable. But given the tiae required for their administration, 
and the attendant costs, they are not likely to achieve wide 
use. 

If eliciting thinking reports alters the course of 
thinking, then thi s should be aanif ested in di f f erent 
perforwances between those being interviewed and those taking 
the test without being interviewed. Hith Total Parf oraance 
Score on iteas 1-15 as the dependent variable, and the No Probe 
group as the control, a 5x3x2x2 fixed effects analysis 
of variance was perforaed with Interview Group, Grade Level, 
Interviewer, and Sex as the independent variables. This allowed 
between 5 and 6 observations per cell using the total sample 
of 343 subjects. 

Question 3; Is thinking subsequent to reporting different 

iZ£l thinking in testing situations in which reports of thinking 

iL§ — not elicited? It is widely believed that in addition to 

illustrating what they know, people often acquire new knowledge 
while taking tests. This fact needs to be taken into account 
in the interpetation of test scores, although knowledge is not 
yet sufficient for doing this well. However, if in presenting 
verbal reports of thinking exaainees learn different things 
froa when they do not provide such reports, then the usefulness 
of studies in the foraer context is diainished for the validation 
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of tests used in the latter context. Question 3 thus also 
addresses the issue raised by the second general question. 

With Total Perf or«ance Scores on i teas 16-28 as the 
dependent variable and the No Probe group as the control, a 
5 X 3 X 2 X 2 fixed effects analysis of variance was perforned 
using the sane independent variables used in the analyses of 
Questi on 2. In order to si«pl if y interpretations, in the 
analyses for all three questions the four-way interaction wean 
square was coiabined with the error tera. 
Qualitative Analyses 

Quantification often entails the loss of some Vichness" 
of information. In particular, representing protocols by a 
series of thinking scores as was done in this experiment is 
bound to lose soae of the information contained in the original 
verbal reports. As an alternative approach to answering Question 
1, I conducted a qualitative analysis of a randoi^i sample of 
40 (stratified by treatment group) of the total sample of 271 
interviews. The following seven categories of verbal moves 
for describing the protocols resulted from this analysis: (i) 
Referenc e to Details - either recalling a factual detail given 
in an item prior to one currently being worked on, recalling 
such a prior detail incorrectly, or stating a detail in the 
current item; (ii) Asking Rhetorical Questions - posing questions 
which appeared to be directed to the examinee himself or herself 
rather than to the interviewer; (iii) Making Self-Evaluations 
- either evaluating judgements or conclusions which had been 
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previously explicitly slated, or evaluating ones which had not 
been verbalized; (iv) Constructing Supporting Assufliotions - 
either making detailed factual assumptions specific to the 
current iterc, or making more generalized assumptions of broad 
principles of appraisal or causal lans covering more than the 
si tuation in the current item} (v) Using Attention Control 
Pgy^^gs. " either making comments about the stage of progress 
reached in reasoning through the problem (Lefs see, Where Has 
I, etc.), or commenting on the direction reasoning should proceed 
(Wait now); (vi) Interacting with the Experimenter - dirpr(:in<] 
comments or questions to the experimenter; and (vii) Pausing 
- either making verbal inflections (Ahhh, rimmm, etc.) or being 
si lent. 

Protocols were coded according to the seven categories 
and occurrences were accumulated for each category across the 
forty subjects. No sophisticated statistical analysis was 
performed. At this stage the data were taken to be purely 
exploratory, and were examined merely for general trends with 
a view to more systematic exploration in the future. The 
question asked was whether interview group membership affected 
the course of thinking in ways that were not detectable by 
differences in thinking scores, but were detectable by the above 
seven categories. 
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VI. RESULTS 
Question 1 

Table lA gives the aain effect neans for each level of 
the four factors examined. An examination of the table indicates 
differences on the order of 1 point or less. Table IB gives 
the analysis of variance sumaary for the four factors. The 
analysis revealed no significant interaction or aain effects. 
The Interview Group main effect Has nonsignificant. 

The qualitative analysis of the 40 randomly chosen 
protocols also revealed little difference asong interview groups, 
which was the factor of primary concern in this study. Given 
the qualitative and speculative nature of the seven categories 
which were developed to describe the protocols, no sophisticated 
statistical analysis of the data was performed. Rather, the 
results were examined for obvious trends which would indicate 
some interesting differences to explore more rigorously. No 
such differences were found. Table IC is a contingency table 
of the seven categories against interview group. While there 
are clear differences between the protocol categories, with 
some having occurrences on the order of hundreds of times and 
others on the order of tens of times, there are no glaring 
differences in trend between interview groups. The categories 
register occurrences with the same order of magnitude across 
all groups. It did not seem reasonable to try to pry more than 
this conclusion from this data. 
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Question 2 



Table IIA contains the oain effect aeans for each level 
of the four factors exaained. Visual inspection of the table 
indicates that all differences are s«all, being on the order 
of about 0.5 on the perforaance scale. Table IIB gives the 
analysis of variance su««ary for the four factors. The analysis 
showed no significant interaction effects, and a significant 
main effect for interviewer. 



Table IIIA contains the main effect aeans. Inspection 
shows that there are only very saall differences for all 
factors. Table IIIB contains the analysis of variance suaaary 
infcraation and shows significant effects for Interviewer, Sex 
and Grade Level. There are no significant interaction effects, 
and the effects for Interview Group are nonsignificant. 



Question 3 
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Table lA 

Hain Effect Heanss Question 1 
Accuracy of Verbal Reports 



Factor Level Mean 



Interview Group 


Think Aloud 


7.9 




loinediate Recall 


9.2 




Criteria Probe 


8.8 




Principle Probe 


9.0 


Intervi ewer 


A 


8.1 




B 


9.3 


Sex 


Hale 


9.2 




Female 


8.3 


Grade 


Level I 


8.2 




Level II 


8.6 




Level HI 


9.5 



Table IB 

Analysis of Variance Suaaary: Question 1 



Source df HS F 



Main Effects 



Interview Group 


^ 


13.8 


0.868 


Intervi ewer 


1 


26.0 


1.64 


Sex 


1 


32.4 


2.04 


Grade 


2 


42.9 


2.70 


Two Hay Interactions 


17 


24.4 


1.53 


Three Way Interactions 


17 


18. 1 


1.14 


Residual 


229 


15.9 
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Table IC 
Analysis: 



Question i 



Think Uaed. Crit. Princ 

Aloud Recall Probe Probe 



Reference to Details 


104 


139 


99 


139 


Rhetorical Questions 


16 


9 


2 


S 


Sel f-Evoluations 


45 


24 


39 


43 


Constructing Assunptions 


17B 


228 


214 


227 


Attention Control 


26 


25 


IS 


19 


Interact with Experiaenter 


19 


9 


12 


13 


Pausing 


499 


387 


424 


380 
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Table IIA 

Main Effect Heans: Question 2 
Thinking Concurrent Hith Reporting 



Factor 


ImC V C A 




Hean 




IntervieH Group 


No Probe (Control) 




7.8 






(111 II K HiuUQ 




a, 0 






Imiiediate Recall 




8.3 






Criteria Probe 




7.9 






Principle Probe 




7.6 




Interviewer 










A 




7.6' 






B 




8.2 




Sex 












Hale 




7.7 




Grade 


Feoal e 




8.0 






Level I 




7.8 






Level II 




7.7 




. — — . 


Level III 




8.1 






Table IIB 








Analysis 


of Variance Suiwary: 


Question 


2 




Source 


df 


HS 




F 


Main Effects 










Interview Group 


4 


5.40 




1.02 


Interviewer 


1 


17.8 




3.35* 


Sex 


1 


3.70 




0.695 


Grade 


2 


4.56 




0.857 


Two Way Interactions 


21 


5.20 




0.977 


Three Way Interactions 


22 


4.75 




0.893 


Residual 


290 


5.32 







♦ p<0.01 
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Table IIIA 



Main EHect Means: Question 3 
Thinking Subsequent to Reporting 


Factor 


Level 




Hean 




Interview Group 


No Probe (Control) 
Think Aloud 
iBwBtnate nficaii 
Criteria Probe 
Principle Probe 




8.4 
8.4 

8.6 
8.1 




Interviewer 


A 
B 




8.2 
8.5 




Sex 


Hale 
Feaale 




8.0 
8.7 




Grade 


Level I 
Level II 
Level III 




7.8 
8.6 
8.8 






Table IIIB 








Analysis o-f Variance Suatary: 


Quest! on 


3 




Source 


df 


MS 




F 


Main Effects 

Interview Group 
Intervi ewer 
Sex 
Grade 


4 
1 
1 
2 


1.93 
12.9 
32.3 
34.5 




0.429 
2.88* 
7. 19** 
7.70** 


Two Way Interactions 


21 


6.43 




1.44 


Three Way Interactions 


22 


4.59 




1.02 


Residual 


290 


4.48 







* p<0.05 
p<0.01 
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VII. DISCUSSION 
Question 1 

Do verbal reports of thinking on tests accurately portray 
the thinking that takes place? The results of this study show 
that the accuracy of reports in portraying the essential elements 
of the thinking process on a critical thinking test does not 
vary across a variety of probing techniques, fro« the nonleading 
elicitation of free reports to the leading ellcitation of 
controlled reports. There were no significant differences in 
the quality of thinking as «easured by Thinking Scores across 
the four levels of probe studied. In addition, the qualitative 
analysis of protocols revealed that there was no esseni-ial 
diff^^rence in the verbal moves used in reporting under different 
elicitation procedures. Both results suggest strongly that 

Lt is subjects' thinking and not how that thinking is elicited 

that controls what is reported . If this result can be 
substantiated, then it would see* that the accuracy of verbal 
reports of thinking on tests is not as sensitive to the type 
of probing as research in other contexts would indicate. 

The issue of the criterion of accuracy aust always be 
kept in aind, though. There is no available technique, nor 
is there likely to be one in the near future, for gaining direct 
ar:ess to people's thinking processes independently of their 
introspective observations. To conduct this study, we assuoed 
that the aost accurate reports could be obtained froa asking 
subjects to think aloud, with no further probes being aade. 
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This Hould seem to provide the least amount of interference 
possible Hhile still eliciting the desired inforaation. When 
compared to this group, the other groups provided equally 
accurate reports. The question regains about the degree to 
which free reports are an accurate reflection of thinking that 
takes place. Accuracy in this context is a function of two 
considerations, whether as far as thev ao reports accurately 
describe the thinking process, and whether reports go far enough 
in giving conplete descriptions of the entire thinking process. 
It is doubtful that verbal reports of thinking are ever fully 
coaplete, for there appears to be «uch thinking for which we 
have little or no introspective access (Nisbett and Hilson, 
1977). There can be soae confidence that what is reported is 
an accurate reflection of that aspect of the thinking process 
which is described. This study suggests that in testing contexts 
such as those employed, the degree to which a probe is leading 
does not affect the accuracy of thinking reports. 

Question 2 

Is thinking that occurs concurrent with reporting on 
thinking different froa thinking in testing situations in which 
reports of thinking are not elicited? Regardless of the accuracy 
of verbal reports of thinking, if sucn reports are to be useful 
in the validation of ability tests in which reports of thinking 
are not sought, then eliciting thea cannot alter the course 
of thinking. If the course of thinking is altered by having 
people report on their thinking, then this alteration could 
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be revealed in altered perf oriaance. Thus, perforaances nhich 
are systeaati cal 1 y similar provide evidence that thinking is 
sinilar, though of course there is no necessity that similar 
perforaiance result froa similar thinking. 

The results showed that there are no significant 
differences in performance on items 1-15 of the Test on 
Appraisi ng Observations between those who reported their thinking 
while working on those items and those who worked on them alone 
while giving no reports of their thinking. This was true for 
all levels of probing, suggesting that probing did not alter 
thinking. If this suggestive result can be substantiated, then 
there are implications for the usefulness of this technique 
that extend beyond test validation contexts. The technique 
should also prove useful for conducting basic research into 
the nature of human reasoning, a use which has already been 
endorsed strongly by Ericsson and Simon (19B4). 

Question 3 

Does the eliciation of reports of thinking have any effect 
on thinking which occurs subsequent to the elicitation? Such 
longer terra effects could occur even if there are no immediate 
effects. In this study very long term effects were not 
examined. Rather, effects on performance were studied 
immediately after the reports were made. When subjects finished 
reporting their thinking on items 1-15 they were asked to work 
on their own on the remaining items on Part A of the test, items 
16-28. The results showed no significant performance differences 
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between those students who had been probed on Iteai 1-15 and 
those who had not been, again suggesting that no significant 
differences in thinking occurred. There seeas, then, to be 
no effects carried over froa the reporting sessions which are 
highly relevant to how students think and perforc on similar 
tasks iamediately thereafter. 



Whenever failure to reject the null hypothesis is the 
desirable result of an experiaent, the power of the test to 
reject a false null hypothesis becoaes an overriding concern. 
Was this experiaent sufficiently powerful to detect any true 
differences which existed aoong the treatnent groups? There 
are a number of reasons which make it highly plausible to believe 
that differences wauld have been detected had they been present 
in the population. The first turns on the fact that the 
treatfljents were considerably different from one another. It 
is quite a different situation for high school students to work 
alone on a test in a fashion they are well used to in school, 
frora their worKing in the presence of a stranger who is probing 
their thinking in a way that hardly ever happens in school. 
Thus, if elicitations of thinking have an effect on the course 
of thinking, then it should have been revealed in differences 
in perforaance between the interviewed and uni ntervi ewed groups. 
In addition, the interview treataents theaselves were highly 
different. The leading probes were quite leading in that they 
made explicit suggestions to students about what could have 
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affected their choices of answers. It Hould have been an easy 
aatter for students to conford to these suggtstions. Instead, 
students would regularly deny that the suggested factor had 
anything to do with their thinking and proceed to explain how 
their choices were made. 

Another r,jason aaking the null results of this experident 
plausible is that effects were sought froa a number of different 
directions, but none were found in any of then. The quantitative 
analyses showed that no differences were detected either in 
the ratings of students' thinking or in ratings of their 
performance both during and after the interview sessions. In 
addition, the qualitative analysis showed that the saae patterns 
of verbal moves were used by each treatment group. It is 
plausible to think that if differences existed they would have 
been detected by at least one of these aethods. 

In addition, it nust be noted that psychological research 
uncovers consistent effects using similar sorts of treatments 
in studies of eyewitness testioony. This does not aean that 
differences should have been found in this study, but it does 
aean that ii differences existed they should have been detected. 
Of course the deaand for an explanation for why no differences 
exist in the situation studied in this experiment arises at 
this point. Although it is highly tentative at this time, there 
is evidence which suggests that the results of psychological 
research on the evaluation of eyewitness testioony are not always 
substantiated in studies of the practice of juries in actual 
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courtrooa situations. For example, although psychological 
research conducted in laboratory contexts suggests that "jurors" 
place an unwarranted anount of confidence in eyewi tness 
testi oiony , studies of real jurors show no such tendencies 
(McCloskey and Egeth, 19B3). One possible evpl^nation of this 
fact is that the gravity the situation induces jurors to 

realize that being sceptical of evidence is iaportant to 
fljaintaining the presuaption of innocence -^f the accused. No 
such importance is attached to psychological exper iaents* 

It is possible that a similar sort of aechanisin night 
have operated in the context of this experia^nt. The study 
required students to take a test, and in our society tests are 
typically treated seriously* Even when it is known * jat the 
results will nave no long-tero consequences for school grades 
or any such matter, it is highly probable that they will still 
be taken seriously* Not oany students are likely to portray 
themselves as being less capable than they actually are by 
deliberately performing poorly. At least it has been ay 
experience that students take very seriously the situations 
I present thea. It is possible that this fact creates a certain 
resistance to being led by suggestive questions which resulted 
in the null result of the experiaent. 

In addition to these considerations, an analysis of the 
statistical power of the experiaent was perforaed using 
techniques describe*^ in Kirk (196B, pp. 107-108). The analysis 
requires the calculation of a parameter and the use of charts 
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based upon a procedure by Tang (1938) which require the setting 
of a probability of Type I error" and knowing the degrees of 
freedo« for the treataent and error effects. The parameter 
is given by: 



where: 



Ti^/ = sufa of squared treatuent effects 

t mmt 



/—I 



n = size of the jth sample 



2 

^( = error variance. 



For the purposes of the calculation, (k-l/n)(MS»« - MS«o) was 
taken as an unbiased estimate of the sua of squared treataent 
effects, and MSvi« as an unbiased estiaate of the population 
error variance. Mith the probability of a Type I error set 
at 0.05 for each analysis, the results showed that the power 
of rejecting the null hypothesis when it was false was >0.97 
for Question 1, >0.96 for Question 2, and >0.99 for Question 
3, These results coupled with the earlier considerations aake 
the null result of this experiiaent highly plausible. 
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VIII, CONCLUSION 

This research points to a useful validation technique 
for the testing field. Studies of thinking processes using 
the verbal reports of exa/iinees has always seemed an obvious 
way to gather data on the construct validity of tests. Such 
studies have long been known to be tiae consuaing and expensive, 
but also their usefulness and justifiability has been uncertain. 
This study provides examples in the context of validating a 
critical thinking test of how such studies of process tight 
be conducted using different questioning procedures. The results 
of the experiaent indicate that the researcher did not have 
to be overly cautious about the "1 eadingness" of the questions 
used to elicit reports of thinking. Basically, examinees 
appeared not to be easily led when reporting on their thinking. 
Comparisons of the quality of thinking displayed in the verbal 
reports, of overall performance on the test, and of the verbal 
moves made while reporting showed no significant differences 
from one interview group to another. In addition, performance 
scores for the interview groups did not differ significantly 
from the nonintervi ewed control group. 

It is not known whether results siailar to these would 
be found with all types of test items or with all types of 
content. Given the lack of knowledge in this area, prudence 
would suggest repeating this experiment for tests with other 
item types and in other content areas. If such were to be done, 
then the research reported here can serve as a prototype *or 
these subsequent studies. 
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